Feature importance

A comprehensive report on feature importance and selection in machine learning

Introduction

An often overlooked but critical aspect of machine learning is model interpretability. While black box models may be used in some cases, most businesses want to be able to understand how exactly their models work and derive insights and improve their business based on these insights.

A good way to interpret model is using the feature importance.This project is about interpreting ML techniques using feature importance and selection.

Feature importance gives a relative ranking of the model features. This is an important step for business who want to analyse what the most important features are for their use case. And this might be more important to them than the exact methods used for modelling.

More the scores, the relative ranking of features is what usually matters.

Why calculate feature importance?

  1. Often we use the feature importance values to select the most important features to the model - giving a simpler model.
  2. Simpler models are more general and can increase the accuracy of the model.
  3. Also, by reducing features, we can train the models faster.

Note:

Below, I have tried out various different ways of caalculation feature importance. Note: Although fetaure importance and selection could work for any type of algorithm, I have restricted my experiments to Random Forest with 100 trees. And I have used the boston dataset (regression) as an example.

About the dataset:

The target of this dataset is the median prices of Boston houses. The independent variables given are related to rooms per dwelling, crime rate and such statistics related to each house. There are 13 such features and we try to find the importance of these features and select them.

In [1]:
%run featimp.py

Spearman's correlation

Spearman's correlation coefficient is calculated by dividing covariance of the two variables with the standard deviations of each of the two varriables. This value indiactes how positively or negatively corrleated the feature is with respect to target variable. This is an easy step to start off EDA and interpret models.

How I Implemented this

To implement this, I have iterated over each column of training data and computed the Spearman correlation coefficient with respect to target variable.

Below, I am calling my plot function for the boston dataset to get the Spearman correlation coefficient.

In [2]:
# get dataset
boston = load_boston()
y = boston.target
X = pd.DataFrame(boston.data, columns = boston.feature_names)
In [3]:
plot_spearman(X,y)

mRMR

mRMR (minimal-redundancy-maximal-relevance) is a feature selection technique which also takes in to consideration the codependencies within the feature variables. So it takes care of both the 'relevance' to the target as well as the 'redundancy' with respect to the features themselves. (i.e codependent features)

For this implementation I have used the Spearman correlation from above as the Importance metric and then used the following formula below to compute the mRMR value.

How I Implemented this

S is the set of selected features. This set is my output after iterating over all the columns in the dataset.

  1. Initially, S is an empty set.

  2. Using the above formula, we calculate the J value for each column x_k and choose the feature with maximum value to add to set S.

  3. The same process is repeated, and elements are added to S one by one as per the mRMR algorithm.

  4. The final set S has the order in which features were selected which is essentially like a rank of selection.

In [4]:
mrmr(X,y)
Out[4]:
mRMR feature selection
0 LSTAT
1 CHAS
2 RM
3 TAX
4 PTRATIO
5 AGE
6 INDUS
7 NOX
8 CRIM
9 ZN
10 B
11 RAD
12 DIS

The above list of features is selected in decreasing order as per the mRMR algorithm. Based on this set, the top-k features can be selected.

Default RF Importance - Gini Drop

Here, I have tried using the default RF importance package to calculate the feature importance scores.

In [5]:
plot_rf(X,y)

The issue with the default importance

From the above blpgpost, (https://explained.ai/rf-importance/index.html) let's explore and understand the issue with the default importances in RF package.

The package computes the gini importance which is biased and does not perform well with high cardinality variables. This is a major drawback of this method of computation. However, the values may be more consistent when the variables are normalized (which is not an usual practice done for tree models in the least)

Drop column importance

Drop column is a simple method which calcualtes the evaluation metric with and without a feature to compute the feature importance.

How I Implemented this

  1. Fit a random forest model on the data and computed OOB score since I used Random Forest (instead of using validation data) - This was my baseline score.
  2. Then, I dropped one column and recomputed the OOB score
  3. The difference between the baseline and drop_column score gives the feature importance value.
  4. Repeated the step for each column to get the drop column feature importances.
In [6]:
plot_drop(X,y)

The issue with drop column importance

  1. Since retrain the model for each feature, this method costs a lot.
  2. When we have collinear features, their importance drops to 0.

Permutation importance

The concept here is the column is shuffled randomly so as to destroy the relationship with target variable. A high change from the baseline or original model (with column intact) indicates that the column is in fact important. However, if there is not much of a difference, this means that the feature is not predictive. Since we don't retrain the model, this is more efficient than drop column.

How I Implemented this

  1. Fit a random forest model on the train data and computed the validation R2- This was my baseline score.
  2. Each column is randomly shuffled and the validation score is recomputed. (Model is not retrained)
  3. The baseline and new metric are compared and the difference is the permutation importance.
  4. This is done for all the columns.
In [7]:
plot_permute(X,y)

The issue with drop column importance

Unlike drop-column, in case of permutation importance, codependent features share an equal importance and don't zero out. However, the issue is the importance can be over-estimated sometimes when codependent features are present.

Existing packages

There many good packages in Python for feature importance visualization, like:

  1. rfpimp
  2. shap
  3. sklearn default
  4. eli5
  5. treeinterpreter
  6. LIME

Out of these, I found the shap package to have impressive visauliztions and below I tried using it on my dataset with RF model.

In [8]:
import shap
# load JS visualization code to notebook
shap.initjs()

# train XGBoost model
X,y = shap.datasets.boston()
rf.fit(X,y)

# explain the model's predictions using SHAP
# (same syntax works for LightGBM, CatBoost, scikit-learn and spark models)
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X)

# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value, shap_values[0,:], X.iloc[0,:])
Out[8]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [9]:
# summarize the effects of all the features
shap.summary_plot(shap_values, X)
In [10]:
shap.summary_plot(shap_values, X, plot_type="bar")

Comparsion of various strategies

Below, I have selected 7 top features based on :

  1. Spearman's correlation
  2. mRMR
  3. Drop column importance
  4. Permutation importance

Using these features, I then compared the validation metric for all the different methods. I tried this with 2 models - RF and OLS. The plots are below. I observed that Drop column, Permutation importance and Spearman gave the same feature selections, hence giving the same graphs.

In [11]:
# Random Forest
plot_compare(rf)
In [12]:
# OLS
plot_compare(ols)

Automatic feature selecttion

To automatically select features, I used the drop column importance that I implemented earlier. I computed my initial baseline score (OOB score).I dropped the least important feature and retarained the model to get the validation score. Once the validation score is less than baseline, the process ends. To visualize this, I have plotted the validation scores. Then I chose the feature set with maximum validation score.

In [13]:
plot_auto()

The top features selected such that the metric is maximum is:

In [14]:
auto(X,y)
Out[14]:
Top k Features
0 LSTAT
1 RM
2 DIS
3 NOX
4 TAX
5 INDUS
6 PTRATIO
7 CHAS
8 ZN
9 B

Variance for feature importances

By bootstrapping the data and retraining the model, we can get the variance in importances.

For implementing this I have used the drop column importance and visualized the standard deviation for each feature.

In [15]:
plot_var(X,y)

Conclusion

The following were the key takeaways from this project for me:

  1. Feature importance is a great tool to interpret machine laerning models and derive business inisghts.

  2. Feature importance from models which have a higher bias/ weak models cannot be trusted.

  3. Importances must be computed from validation and not training set. (Or oob score in case of RF works)

  4. Importances may vary from model to model and cannot be reused while training different models.

  5. By selecting features and identifying importance, we can exaplin or interpret ML models to business stakeholders. Visualizing them is a good way to communicate the results.

In [ ]: